[QEff. Finetune]: Added support to sync gradients across devices during optimizer step only. #477

quic-meetkuma · 2025-06-23T10:30:20Z

Disabling gradient is necessary when using gradient_accumulation_step > 1 with ddp enabled.
Currently, we are syncing gradient at every loss.backward() call, which is called at all steps. When using gradient accumulation, the weight update during opt.step() step. Only during that step, the gradients across each devices should sync with each other.

with model.no_sync() --> context manager solves this issue.

Here, we are not using it but instead setting ddp_model.require_backward_grad_sync to True or False depending on which step we are.

QEfficient/finetune/utils/train_utils.py

Signed-off-by: Meet Patel <[email protected]>

Signed-off-by: meetkuma <[email protected]>

…ng optimizer step only. (#477) Disabling gradient is necessary when using gradient_accumulation_step > 1 with ddp enabled. Currently, we are syncing gradient at every loss.backward() call, which is called at all steps. When using gradient accumulation, the weight update during opt.step() step. Only during that step, the gradients across each devices should sync with each other. with model.no_sync() --> context manager solves this issue. Here, we are not using it but instead setting ddp_model.require_backward_grad_sync to True or False depending on which step we are. --------- Signed-off-by: Meet Patel <[email protected]> Signed-off-by: meetkuma <[email protected]> Signed-off-by: Amit Raj <[email protected]>

…ng optimizer step only. (#477) Disabling gradient is necessary when using gradient_accumulation_step > 1 with ddp enabled. Currently, we are syncing gradient at every loss.backward() call, which is called at all steps. When using gradient accumulation, the weight update during opt.step() step. Only during that step, the gradients across each devices should sync with each other. with model.no_sync() --> context manager solves this issue. Here, we are not using it but instead setting ddp_model.require_backward_grad_sync to True or False depending on which step we are. --------- Signed-off-by: Meet Patel <[email protected]> Signed-off-by: meetkuma <[email protected]>

…ng optimizer step only. (quic#477) Disabling gradient is necessary when using gradient_accumulation_step > 1 with ddp enabled. Currently, we are syncing gradient at every loss.backward() call, which is called at all steps. When using gradient accumulation, the weight update during opt.step() step. Only during that step, the gradients across each devices should sync with each other. with model.no_sync() --> context manager solves this issue. Here, we are not using it but instead setting ddp_model.require_backward_grad_sync to True or False depending on which step we are. --------- Signed-off-by: Meet Patel <[email protected]> Signed-off-by: meetkuma <[email protected]>

…ng optimizer step only. (quic#477) Disabling gradient is necessary when using gradient_accumulation_step > 1 with ddp enabled. Currently, we are syncing gradient at every loss.backward() call, which is called at all steps. When using gradient accumulation, the weight update during opt.step() step. Only during that step, the gradients across each devices should sync with each other. with model.no_sync() --> context manager solves this issue. Here, we are not using it but instead setting ddp_model.require_backward_grad_sync to True or False depending on which step we are. --------- Signed-off-by: Meet Patel <[email protected]> Signed-off-by: meetkuma <[email protected]> Signed-off-by: Dhiraj Kumar Sah <[email protected]>

quic-meetkuma force-pushed the no_sync branch from 6b36ea7 to f5a350a Compare June 27, 2025 08:45

quic-meetkuma marked this pull request as ready for review June 27, 2025 08:50

quic-meetkuma requested review from ochougul, quic-amitraj, quic-hemagnih and quic-rishinr as code owners June 27, 2025 08:50

quic-meetkuma requested review from quic-akuruvil, quic-mamta, quic-swatia and vbaddi July 1, 2025 08:48

quic-meetkuma force-pushed the no_sync branch from f5a350a to e2a1d0b Compare July 3, 2025 05:22

quic-swatia reviewed Jul 7, 2025

View reviewed changes

QEfficient/finetune/utils/train_utils.py Outdated Show resolved Hide resolved

quic-meetkuma added 3 commits July 8, 2025 10:51

Updated training code to sync gradients only during backward step.

7e028c4

Signed-off-by: Meet Patel <[email protected]>

Fixed minor argument error.

fb04f8e

Signed-off-by: Meet Patel <[email protected]>

Changed is_backward_step var name to is_optimizer_step.

0d817e3

Signed-off-by: meetkuma <[email protected]>

quic-meetkuma force-pushed the no_sync branch from 30026c8 to 0d817e3 Compare July 8, 2025 05:21

quic-meetkuma added 2 commits July 8, 2025 10:54

Merged autocast and op verifier context managers.

b21927a

Signed-off-by: meetkuma <[email protected]>

Fixed code formatting error.

13066ed

Signed-off-by: meetkuma <[email protected]>

quic-meetkuma changed the title ~~[QEff. Finetune]: Added support to sync gradients across devices during backward step only.~~ [QEff. Finetune]: Added support to sync gradients across devices during optimizer step only. Jul 9, 2025

quic-mamta approved these changes Jul 9, 2025

View reviewed changes

quic-swatia approved these changes Jul 9, 2025

View reviewed changes

quic-swatia merged commit 3aaa2d8 into quic:main Jul 9, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[QEff. Finetune]: Added support to sync gradients across devices during optimizer step only. #477

[QEff. Finetune]: Added support to sync gradients across devices during optimizer step only. #477

Uh oh!

quic-meetkuma commented Jun 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[QEff. Finetune]: Added support to sync gradients across devices during optimizer step only. #477

[QEff. Finetune]: Added support to sync gradients across devices during optimizer step only. #477

Uh oh!

Conversation

quic-meetkuma commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

quic-meetkuma commented Jun 23, 2025 •

edited

Loading